Loading in packages, removing objects from the environment and setting the working directory
if (!require("pacman")) install.packages("pacman"); library(pacman)
pacman::p_load(tidyverse, caTools, lubridate,
ggthemes, janitor, corrplot, GGally, psych)
rm(list=ls())
setwd("/Users/Jesse/R/MedInsuranceProject/")
Reading in the data
df <- read_csv("/Users/Jesse/R/MedInsuranceProject/insurance.csv")
Let’s take a took at the data
head(df)
## # A tibble: 6 x 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
str(df)
## spec_tbl_df [1,338 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:1338] 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr [1:1338] "female" "male" "male" "male" ...
## $ bmi : num [1:1338] 27.9 33.8 33 22.7 28.9 ...
## $ children: num [1:1338] 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr [1:1338] "yes" "no" "no" "no" ...
## $ region : chr [1:1338] "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num [1:1338] 16885 1726 4449 21984 3867 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_character(),
## .. bmi = col_double(),
## .. children = col_double(),
## .. smoker = col_character(),
## .. region = col_character(),
## .. charges = col_double()
## .. )
Let’s change the character columns to factors
df$sex <- as.factor(df$sex)
df$smoker <- as.factor(df$smoker)
df$region <- as.factor(df$region)
Finally, let’s check for any missing values
any(is.na(df))
## [1] FALSE
No missing data! Let’s move on to some exploratory data analysis
summary(df)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
Let’s take a look at a histogram of the insurance charges
df %>% ggplot(aes(charges)) + geom_histogram(aes(y=stat(density)), fill = "light blue", color = "white", bins = 30) +
ggtitle("Distribution of Charges") + theme_bw() + theme(plot.title = element_text(hjust = .5)) +
geom_density(col="dark blue")
Let’s make the distribution closer to normal
df %>% ggplot(aes(log10(charges))) + geom_histogram(aes(y=stat(density)), fill = "light blue", color = "white", bins = 30) +
ggtitle("Distribution of Charges") + theme_bw() + theme(plot.title = element_text(hjust = .5)) +
geom_density(col = "dark blue")
Now let’s look at charges by region
df %>% ggplot(aes(region, charges)) + geom_boxplot(fill=c(3,4,6,7)) +
theme_bw() + ggtitle("Medical Insurance Charges by Region")
It looks like charges don’t differ much across regions, but the highest medical charges are in the Southwest
Let’s check out charges by smoker status
df %>% ggplot(aes(smoker, charges)) + geom_boxplot(fill=c(3,2)) +
theme_bw() + ggtitle("Medical Insurance Charges by Smoker Status")
As you would expect, smokers spend a lot more than non-smokers when it comes to medical insurance expenses
Next, let’s look at sex
df %>% ggplot(aes(sex, charges)) + geom_boxplot(fill=c(5,7)) +
theme_bw() + ggtitle("Medical Insurance Charges by Sex")
It appears sex doesn’t seem to affect medical expenses very much
Moving on to number of children
df %>% ggplot(aes(as.factor(children), charges)) + geom_boxplot(fill=c(2:7)) +
theme_bw() + ggtitle("Medical Insurance Charges by Number of Children")
Interestingly, medical expenses seem to increase with the number of children until you have five
Let’s check out obesity now. First we need to create a new column for obese, which is greater than 30 BMI
df <- df %>%
mutate(obese = as.factor(if_else(bmi >= 30, "yes", "no")))
df %>% ggplot(aes(obese, charges)) + geom_boxplot(fill=c(6,4)) +
theme_bw() + ggtitle("Medical Charges by Obesity status (BMI 30+)")
As we can see, the average medical expenses for an obese individual is about $5000 more than non-obese people, while the median expenses betweent the group is similar
Lastly, let’s look at age, but also with smoker status
df %>% ggplot(aes(age, charges)) + geom_point(aes(color = smoker)) +
theme_bw() + ggtitle("Medical Charges by Age and Smoker Status") + scale_color_manual(values=c("red", "blue"))
Age and medical expenses don’t appear to be linear, but with the smoker status you can see the relationship between smoking and higher medical expenses
Alright, let’s move on to looking at correlation between variables
num_cols <- sapply(df,is.numeric) # using only numeric features
ggpairs(df[,num_cols]) # matrix of plots
pairs.panels(df[,num_cols]) # matrix of plots using pairs.panels
We can see that age and charges have the highest correlation among the numeric variables. It’s also interesting to note that none of our numeric values is highly correlated with each other, so multicollinearity wouldn’t be a problem.
Now let’s look at the correlation matrix for numeric features a different way
cor_mx <- cor(df[,num_cols]) # correlation matrix
cor_mx
## age bmi children charges
## age 1.0000000 0.1092719 0.04246900 0.29900819
## bmi 0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges 0.2990082 0.1983410 0.06799823 1.00000000
corrplot(cor_mx, method = "color", title = "Correlation Matrix for Numeric Features",
mar=c(0,0,1,0))
Let’s move on and look at correlation between all features. But first we need to create dummy variables for sex and smoker
df_dummy <- df %>%
mutate(sex_dummy = if_else(sex == "male", 1, 0)) %>%
mutate(smoker_dummy = if_else(smoker == "yes", 1, 0))
df_dummy <- df_dummy %>%
select(1,3,4,7,9,10)
# correlation all features
ggpairs(df_dummy)
pairs.panels(df_dummy)
dummy_cor_mx <- cor(df_dummy)
corrplot(dummy_cor_mx, method = "color", title = "Correlation Matrix for All Features",
mar=c(0,0,1,0))
We can see that smoking and medical expense are highly correlated.
Let’s move on to building a model using all of the original features, except using obese instead of BMI because BMI is essentially the same thing as obese
set.seed(214)
sample <- sample.split(df$charges, SplitRatio = 0.70)
# Training Data
train = subset(df, sample == TRUE)
# Testing Data
test = subset(df, sample == FALSE)
med_model <- lm(charges ~ .-bmi, data = train)
summary(med_model)
##
## Call:
## lm(formula = charges ~ . - bmi, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12378.6 -3632.7 -411.7 1184.5 28190.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4148.16 760.41 -5.455 6.28e-08 ***
## age 269.45 14.26 18.890 < 2e-16 ***
## sexmale 57.44 402.38 0.143 0.88652
## children 502.32 164.85 3.047 0.00238 **
## smokeryes 22867.71 504.00 45.373 < 2e-16 ***
## regionnorthwest -587.99 580.23 -1.013 0.31115
## regionsoutheast -498.98 571.72 -0.873 0.38302
## regionsouthwest -1270.60 580.84 -2.188 0.02895 *
## obeseyes 3999.25 409.43 9.768 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6109 on 927 degrees of freedom
## Multiple R-squared: 0.7304, Adjusted R-squared: 0.7281
## F-statistic: 314 on 8 and 927 DF, p-value: < 2.2e-16
plot(med_model)
Just using the original variables we get a fairly good R-squared of 0.7486, which implies that about 75% of the variation in charges can be explained by the independent variables. Also, all variables except for region and sex, are statistically significant predictors of medical expense (p < .05)
Let’s attempt to improve the model
First, let’s remove region and sex, as they were not statistically significant
med_model2 <- lm(charges ~ . -bmi -sex -region, data = train)
summary(med_model2)
##
## Call:
## lm(formula = charges ~ . - bmi - sex - region, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12644.8 -3631.6 -284.8 1183.8 28264.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4711.97 646.07 -7.293 6.46e-13 ***
## age 269.60 14.26 18.900 < 2e-16 ***
## children 490.02 164.61 2.977 0.00299 **
## smokeryes 22932.73 500.57 45.813 < 2e-16 ***
## obeseyes 3975.47 402.32 9.881 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6112 on 931 degrees of freedom
## Multiple R-squared: 0.729, Adjusted R-squared: 0.7278
## F-statistic: 626.1 on 4 and 931 DF, p-value: < 2.2e-16
plot(med_model2)
This model about the same, but slightly worse, with an R-squared of 0.7476, still implying about 75% of the variation in charges can be explained by the independent variables. Let’s try a model with an interaction between obesity and smoker status
med_model3 <- lm(charges ~ obese*smoker + age + children, data = train)
summary(med_model3)
##
## Call:
## lm(formula = charges ~ obese * smoker + age + children, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19072.0 -2025.0 -1395.7 -604.1 24448.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2834.59 503.27 -5.632 2.36e-08 ***
## obeseyes 54.04 346.65 0.156 0.876
## smokeryes 13077.17 548.52 23.841 < 2e-16 ***
## age 272.33 10.99 24.779 < 2e-16 ***
## children 609.37 126.91 4.802 1.83e-06 ***
## obeseyes:smokeryes 19481.61 771.01 25.268 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4709 on 930 degrees of freedom
## Multiple R-squared: 0.8393, Adjusted R-squared: 0.8385
## F-statistic: 971.6 on 5 and 930 DF, p-value: < 2.2e-16
plot(med_model3)
This third model is much better than the first two, with an R-squared of .8503, implying about 85% of the variation in charges can be explained by the independent variables. Our Residual vs Fitted plot appears a little better, although the Normal Q-Q plot still indicates some problems with our fit.
test$predicted <- predict(med_model3, newdata = test)
test %>%
ggplot() + geom_point(aes(predicted, charges, color = smoker, shape = obese)) +
geom_abline(color = "red") + ggtitle("Prediction vs. Real Values")
It looks like our model is a pretty solid fit for our test data and our results are pretty accurate.
#calculating residuals
test$residuals <- test$charges - test$predicted
#plot residuals
test %>%
ggplot() + geom_pointrange(aes(predicted, residuals, ymin = 0, ymax = residuals, color = smoker, shape = obese)) +
geom_hline(yintercept = 0) + ggtitle("Residuals vs. Fitted Values")
Overall, the residuals look pretty good, although there are several high residuals that could be caused by factors not included in the original data set.